21 research outputs found

    Jabba: hybrid error correction for long sequencing reads using maximal exact matches

    Get PDF
    Third generation sequencing platforms produce longer reads with higher error rates than second generation sequencing technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is that this mapping is constructed with a seed and extend methodology, using maximal exact matches as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of maximal exact matches in the context of third generation reads are presented

    Computational assessment of the feasibility of protonation-based protein sequencing

    Get PDF
    Recent advances in DNA sequencing methods revolutionized biology by providing highly accurate reads, with high throughput or high read length. These read data are being used in many biological and medical applications. Modern DNA sequencing methods have no equivalent in protein sequencing, severely limiting the widespread application of protein data. Recently, several optical protein sequencing methods have been proposed that rely on the fluorescent labeling of amino acids. Here, we introduce the reprotonation-deprotonation protein sequencing method. Unlike other methods, this proposed technique relies on the measurement of an electrical signal and requires no fluorescent labeling. In reprotonation-deprotonation protein sequencing, the terminal amino acid is identified through its unique protonation signal, and by repeatedly cleaving the terminal amino acids one-by-one, each amino acid in the peptide is measured. By means of simulations, we show that, given a reference database of known proteins, reprotonation-deprotonation sequencing has the potential to correctly identify proteins in a sample. Our simulations provide target values for the signal-to-noise ratios that sensor devices need to attain in order to detect reprotonation-deprotonation events, as well as suitable pH values and required measurement times per amino acid. For instance, an SNR of 10 is required for a 61.71% proteome recovery rate with 100 ms measurement time per amino acid

    OMSim : a simulator for optical map data

    Get PDF
    Motivation: The Bionano Genomics platform allows for the optical detection of short sequence patterns in very long DNA molecules (up to 2.5 Mbp). Molecules with overlapping patterns can be assembled to generate a consensus optical map of the entire genome. In turn, these optical maps can be used to validate or improve de novo genome assembly projects or to detect large-scale structural variation in genomes. Simulated optical map data can assist in the development and benchmarking of tools that operate on those data, such as alignment and assembly software. Additionally, it can help to optimize the experimental setup for a genome of interest. Such a simulator is currently not available. Results: We have developed a simulator, OMSim, that produces synthetic optical map data that mimics real Bionano Genomics data. These simulated data have been tested for compatibility with the Bionano Genomics Irys software system and the Irys-scaffolding scripts. OMSim is capable of handling very large genomes (over 30 Gbp) with high throughput and low memory requirements

    Jabba: hybrid error correction for long sequencing reads

    Get PDF
    Background: Third generation sequencing platforms produce longer reads with higher error rates than second generation technologies. While the improved read length can provide useful information for downstream analysis, underlying algorithms are challenged by the high error rate. Error correction methods in which accurate short reads are used to correct noisy long reads appear to be attractive to generate high-quality long reads. Methods that align short reads to long reads do not optimally use the information contained in the second generation data, and suffer from large runtimes. Recently, a new hybrid error correcting method has been proposed, where the second generation data is first assembled into a de Bruijn graph, on which the long reads are then aligned. Results: In this context we present Jabba, a hybrid method to correct long third generation reads by mapping them on a corrected de Bruijn graph that was constructed from second generation data. Unique to our method is the use of a pseudo alignment approach with a seed-and-extend methodology, using maximal exact matches (MEMs) as seeds. In addition to benchmark results, certain theoretical results concerning the possibilities and limitations of the use of MEMs in the context of third generation reads are presented. Conclusion: Jabba produces highly reliable corrected reads: almost all corrected reads align to the reference, and these alignments have a very high identity. Many of the aligned reads are error-free. Additionally, Jabba corrects reads using a very low amount of CPU time. From this we conclude that pseudo alignment with MEMs is a fast and reliable method to map long highly erroneous sequences on a de Bruijn graph

    BrownieAligner : accurate alignment of Illumina sequencing data to de Bruijn graphs

    Get PDF
    Background: Aligning short reads to a reference genome is an important task in many genome analysis pipelines. This task is computationally more complex when the reference genome is provided in the form of a de Bruijn graph instead of a linear sequence string. Results: We present a branch and bound alignment algorithm that uses the seed-and-extend paradigm to accurately align short Illumina reads to a graph. Given a seed, the algorithm greedily explores all branches of the tree until the optimal alignment path is found. To reduce the search space we compute upper bounds to the alignment score for each branch and discard the branch if it cannot improve the best solution found so far. Additionally, by using a two-pass alignment strategy and a higher-order Markov model, paths in the de Bruijn graph that do not represent a subsequence in the original reference genome are discarded from the search procedure. Conclusions: BrownieAligner is applied to both synthetic and real datasets. It generally outperforms other state-of-the-art tools in terms of accuracy, while having similar runtime and memory requirements. Our results show that using the higher-order Markov model in BrownieAligner improves the accuracy, while the branch and bound algorithm reduces runtime. BrownieAligner is written in standard C++11 and released under GPL license. BrownieAligner relies on multithreading to take advantage of multi-core/multi-CPU systems

    Illumina error correction near highly repetitive DNA regions improves de novo genome assembly

    Get PDF
    BACKGROUND : Several standalone error correction tools have been proposed to correct sequencing errors in Illumina data in order to facilitate de novo genome assembly. However, in a recent survey, we showed that state-of-the-art assemblers often did not benefit from this pre-correction step. We found that many error correction tools introduce new errors in reads that overlap highly repetitive DNA regions such as low-complexity patterns or short homopolymers, ultimately leading to a more fragmented assembly. RESULTS : We propose BrownieCorrector, an error correction tool for Illumina sequencing data that focuses on the correction of only those reads that overlap short DNA patterns that are highly repetitive in the genome. BrownieCorrector extracts all reads that contain such a pattern and clusters them into different groups using a community detection algorithm that takes into account both the sequence similarity between overlapping reads and their respective paired-end reads. Each cluster holds reads that originate from the same genomic region and hence each cluster can be corrected individually, thus providing a consistent correction for all reads within that cluster. CONCLUSIONS : BrownieCorrector is benchmarked using six real Illumina datasets for different eukaryotic genomes. The prior use of BrownieCorrector improves assembly results over the use of uncorrected reads in all cases. In comparison with other error correction tools, BrownieCorrector leads to the best assembly results in most cases even though less than 2% of the reads within a dataset are corrected. Additionally, we investigate the impact of error correction on hybrid assembly where the corrected Illumina reads are supplemented with PacBio data. Our results confirm that BrownieCorrector improves the quality of hybrid genome assembly as well. BrownieCorrector is written in standard C++11 and released under GPL license. BrownieCorrector relies on multithreading to take advantage of multi-core/multi-CPU systems. The source code is available at https://github.com/biointec/browniecorrector.The Research Foundation - Flanders (FWO) (G0C3914N). Computational resources and services were provided by the Flemish Supercomputer Center, funded by Ghent University, the Hercules Foundation and the Flemish Government – EWIhttps://bmcbioinformatics.biomedcentral.comam2020Genetic

    Evaluation of the impact of Illumina error correction tools on de novo genome assembly

    Get PDF
    BACKGROUND : Recently, many standalone applications have been proposed to correct sequencing errors in Illumina data. The key idea is that downstream analysis tools such as de novo genome assemblers benefit from a reduced error rate in the input data. Surprisingly, a systematic validation of this assumption using state-of-the-art assembly methods is lacking, even for recently published methods. RESULTS : For twelve recent Illumina error correction tools (EC tools) we evaluated both their ability to correct sequencing errors and their ability to improve de novo genome assembly in terms of contig size and accuracy. CONCLUSIONS : We confirm that most EC tools reduce the number of errors in sequencing data without introducing many new errors. However, we found that many EC tools suffer from poor performance in certain sequence contexts such as regions with low coverage or regions that contain short repeated or low-complexity sequences. Reads overlapping such regions are often ill-corrected in an inconsistent manner, leading to breakpoints in the resulting assemblies that are not present in assemblies obtained from uncorrected data. Resolving this systematic flaw in future EC tools could greatly improve the applicability of such tools.Additional file 1: Supplementary Data. Evaluation of the impact of Illumina error correction tools on de novo genome assembly.The Research Foundation - Flanders (FWO) (G0C3914N)http://www.biomedcentral.com/bmcbioinformaticsam2017Genetic

    Genome-wide expression and network analyses of mutants in key brassinosteroid signaling genes

    Get PDF
    ADDITIONAL FILE 1. List of differentially expressed genes as compared to WS2 in each line and group membership.ADDITIONAL FILE 2. List of marker genes being differentially expressed upon addition of external BRs or in line with a gain of mutation in BR signaling genes at least in 5 studies.ADDITIONAL FILE 3. GO enrichment for genes exclusively differentially expressed in each suppressor.ADDITIONAL FILE: Fig. S1. The T-DNA insertion site for bri1–1D (A), and microscopic images of 7-day old hypocotyl cells for WS2 (B), bri1–5 (C), bri1–5/bak1–1D (D), bri1–5/bri1–1D (E), bri1–5/brs1–1D (F). Fig. S2. PCA plot for assessing the reproducibility of the gene expression dataset. Samples taken from the same genotype are represented in the same color. The plot indicates high consistency between replicate samples as they are located close to each other when plotted on the first and second principal components. Fig. S3. RT-qPCR results for relative expression of selected genes and their corresponding values from microarray analysis. The values represent the log2 of relative expression (sample1/sample2). Rows indicate gene names and columns show the comparison between the indicated lines. Columns with pink header represent the RT-qPCR values, and columns with yellow header are microarray measurements. The red color on the heatmap indicates that the gene has been up-regulated in sample1 as compared to sample2, while blue indicates down-regulation. Fig. S4. Comparing genome-wide expression impact between bri1–5 suppressor lines. Fig. S5. Heatmap of expression of the marker genes that up/down regulation of their expression was confirmed by at least 5 independent references and also affected in the bri1–5 line of our study. For each line, the row-scaled normalized expression data of the 3 biological replicates are shown as adjacent columns. In each row the gradient red color indicates the higher expression for the gene compared to other samples while blue indicates the lower expression. Fig. S6. Pathway analysis (MapMan metabolism) showing for each mutant line the expression changes compared to WS2. Panel A: bri1–5, Panel B: bri1–5/bri1–1D, Panel C: bri1–5/brs1–1D, Panel D: bri1–5/bak1–1D. Fig. S7. Pathway analysis (MapMan: large enzyme families) showing for each mutant line the expression changes compared to WS2. Panel A: bri1–5, Panel B: bri1–5/bri1–1D, Panel C: bri1–5/brs1–1D, Panel D: bri1–5/bak1–1D. Fig. S8. Pathway analysis (MapMan: gene regulation) showing for each mutant line the expression changes compared to WS2. Panel A: bri1–5, Panel B: bri1–5/bri1–1D, Panel C: bri1–5/brs1–1D, Panel D: bri1–5/bak1–1D. Fig. S9. Expression pattern in each mutant line of genes related to ABA signaling, Glutathione metabolism, and ion related hemostasis as discussed in the main text. Mutant lines are represented in the x-axis. The y-axis indicates the log2 normalized expression value of the gene. Table S1. RT-qPCR test of log-fold change (log-FC) of the genes that are overexpressed by activation-tagging in the suppressors at the 7 days seedling stage. Table S2. Summary of the most significant results obtained by MapMan pathway analysis (metabolism, regulation and, large-enzyme families overview). Left column: enriched pathways; entries provide for each line the degree to which the pathway is enriched. P-values are FDR corrected using Benjamini-Hochberg). Table S3. Designed primers for RT-qPCR.BACKGROUND: Brassinosteroid (BR) signaling regulates plant growth and development in concert with other signaling pathways. Although many genes have been identified that play a role in BR signaling, the biological and functional consequences of disrupting those key BR genes still require detailed investigation. RESULTS: Here we performed phenotypic and transcriptomic comparisons of A. thaliana lines carrying a loss-of-function mutation in BRI1 gene, bri1–5, that exhibits a dwarf phenotype and its three activation-tag suppressor lines that were able to partially revert the bri1–5 mutant phenotype to a WS2 phenotype, namely bri1–5/bri1–1D, bri1–5/brs1–1D, and bri1–5/bak1–1D. From the three investigated bri1–5 suppressors, bri1–5/bak1–1D was the most effective suppressor at the transcriptional level. All three bri1–5 suppressors showed altered expression of the genes in the abscisic acid (ABA signaling) pathway, indicating that ABA likely contributes to the partial recovery of the wild-type phenotype in these bri1–5 suppressors. Network analysis revealed crosstalk between BR and other phytohormone signaling pathways, suggesting that interference with one hormone signaling pathway affects other hormone signaling pathways. In addition, differential expression analysis suggested the existence of a strong negative feedback from BR signaling on BR biosynthesis and also predicted that BRS1, rather than being directly involved in signaling, might be responsible for providing an optimal environment for the interaction between BRI1 and its ligand. CONCLUSIONS: Our study provides insights into the molecular mechanisms and functions of key brassinosteroid (BR) signaling genes, especially BRS1.The National Basic Research Program of China, Youth Innovation Promotion Association of the Chinese Academy of Sciences, the Ministry of Science, Research and Technology, Iran, the Fonds Wetenschappelijk Onderzoek-Vlaanderen (FWO) and UGent Bijzonder onderzoeksfonds.http://www.biomedcentral.com/bmcgenomicspm2021BiochemistryGeneticsMicrobiology and Plant Patholog
    corecore